{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bccd47fc",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/postgres.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db0855d0",
   "metadata": {},
   "source": [
    "# Postgres Vector Store\n",
    "In this notebook we are going to show how to use [Postgresql](https://www.postgresql.org) and  [pgvector](https://github.com/pgvector/pgvector)  to perform vector searches in LlamaIndex"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e4f33fc9",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2fc9c18",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-vector-stores-postgres"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "712daea5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eadf6b8a",
   "metadata": {},
   "source": [
    "Running the following cell will install Postgres with PGVector in Colab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ab46b5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!sudo apt update\n",
    "!echo | sudo apt install -y postgresql-common\n",
    "!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh\n",
    "!echo | sudo apt install postgresql-15-pgvector\n",
    "!sudo service postgresql start\n",
    "!sudo -u postgres psql -c \"ALTER USER postgres PASSWORD 'password';\"\n",
    "!sudo -u postgres psql -c \"CREATE DATABASE vector_db;\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2d1c538",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import logging\n",
    "# import sys\n",
    "\n",
    "# Uncomment to see debug logs\n",
    "# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\n",
    "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "\n",
    "from llama_index.core import SimpleDirectoryReader, StorageContext\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.vector_stores.postgres import PGVectorStore\n",
    "import textwrap"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26c71b6d",
   "metadata": {},
   "source": [
    "### Setup OpenAI\n",
    "The first step is to configure the openai key. It will be used to created embeddings for the documents loaded into the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67b86621",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eecf4bd5",
   "metadata": {},
   "source": [
    "Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6df9fa89",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396",
   "metadata": {},
   "source": [
    "### Loading documents\n",
    "Load the documents stored in the `data/paul_graham/` using the SimpleDirectoryReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c154dd4b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document ID: 56e70c8c-0fb7-4250-99be-b953d0185a01\n"
     ]
    }
   ],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham\").load_data()\n",
    "print(\"Document ID:\", documents[0].doc_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bd24f0a",
   "metadata": {},
   "source": [
    "### Create the Database\n",
    "Using an existing postgres running at localhost, create the database we'll be using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6d61e73",
   "metadata": {},
   "outputs": [],
   "source": [
    "import psycopg2\n",
    "\n",
    "connection_string = \"postgresql://postgres:password@localhost:5432\"\n",
    "db_name = \"vector_db\"\n",
    "conn = psycopg2.connect(connection_string)\n",
    "conn.autocommit = True\n",
    "\n",
    "with conn.cursor() as c:\n",
    "    c.execute(f\"DROP DATABASE IF EXISTS {db_name}\")\n",
    "    c.execute(f\"CREATE DATABASE {db_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0232fd1",
   "metadata": {},
   "source": [
    "### Create the index\n",
    "Here we create an index backed by Postgres using the documents loaded previously. PGVectorStore takes a few arguments. The example below constructs a PGVectorStore with a HNSW index with m = 16, ef_construction = 64, and ef_search = 40, with the `vector_cosine_ops` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8731da62",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "baff367dece9412a947e9e957b08ddea",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9b113f3dad2f4b1aa07e217d4046d85b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Generating embeddings:   0%|          | 0/22 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:47:21,725 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    }
   ],
   "source": [
    "from sqlalchemy import make_url\n",
    "\n",
    "url = make_url(connection_string)\n",
    "vector_store = PGVectorStore.from_params(\n",
    "    database=db_name,\n",
    "    host=url.host,\n",
    "    password=url.password,\n",
    "    port=url.port,\n",
    "    user=url.username,\n",
    "    table_name=\"paul_graham_essay\",\n",
    "    embed_dim=1536,  # openai embedding dimension\n",
    "    hnsw_kwargs={\n",
    "        \"hnsw_m\": 16,\n",
    "        \"hnsw_ef_construction\": 64,\n",
    "        \"hnsw_ef_search\": 40,\n",
    "        \"hnsw_dist_method\": \"vector_cosine_ops\",\n",
    "    },\n",
    ")\n",
    "\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, show_progress=True\n",
    ")\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ee4473a-094f-4d0a-a825-e1213db07240",
   "metadata": {},
   "source": [
    "### Query the index\n",
    "We can now ask questions using our index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a2bcc07",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:47:30,412 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
      "2025-09-11 16:47:31,665 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\"What did the author do?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cf55bf7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The author worked on writing essays, programming, building microcomputers, predicting rocket\n",
      "heights, developing a word processor, and giving talks on starting a startup.\n"
     ]
    }
   ],
   "source": [
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68cbd239-880e-41a3-98d8-dbb3fab55431",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:47:37,531 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
      "2025-09-11 16:47:38,352 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\"What happened in the mid 1980s?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdf5287f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AI was in the air in the mid 1980s, and two things that influenced the desire to work on it were a\n",
      "novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called\n",
      "Mike, and a PBS documentary that showed Terry Winograd using SHRDLU.\n"
     ]
    }
   ],
   "source": [
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3bed9e1",
   "metadata": {},
   "source": [
    "### Querying existing index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6b2634b",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = PGVectorStore.from_params(\n",
    "    database=\"vector_db\",\n",
    "    host=\"localhost\",\n",
    "    password=\"password\",\n",
    "    port=5432,\n",
    "    user=\"postgres\",\n",
    "    table_name=\"paul_graham_essay\",\n",
    "    embed_dim=1536,  # openai embedding dimension\n",
    "    hnsw_kwargs={\n",
    "        \"hnsw_m\": 16,\n",
    "        \"hnsw_ef_construction\": 64,\n",
    "        \"hnsw_ef_search\": 40,\n",
    "        \"hnsw_dist_method\": \"vector_cosine_ops\",\n",
    "    },\n",
    ")\n",
    "\n",
    "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7075af3-156e-4bde-8f76-6d9dee86861f",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\"What did the author do?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b088c090",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The author worked on writing essays, programming, creating microcomputers, developing software,\n",
      "giving talks, and starting a startup.\n"
     ]
    }
   ],
   "source": [
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55745895-8f01-4275-abaa-b2ebef2cb4c7",
   "metadata": {},
   "source": [
    "### Hybrid Search\n",
    "\n",
    "To enable hybrid search, you need to:\n",
    "1. pass in `hybrid_search=True` when constructing the `PGVectorStore` (and optionally configure `text_search_config` with the desired language)\n",
    "2. pass in `vector_store_query_mode=\"hybrid\"` when constructing the query engine (this config is passed to the retriever under the hood). You can also optionally set the `sparse_top_k` to configure how many results we should obtain from sparse text search (default is using the same value as `similarity_top_k`). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65a7e133-39da-40c5-b2c5-7af2c0a3a792",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import make_url\n",
    "\n",
    "url = make_url(connection_string)\n",
    "hybrid_vector_store = PGVectorStore.from_params(\n",
    "    database=db_name,\n",
    "    host=url.host,\n",
    "    password=url.password,\n",
    "    port=url.port,\n",
    "    user=url.username,\n",
    "    table_name=\"paul_graham_essay_hybrid_search\",\n",
    "    embed_dim=1536,  # openai embedding dimension\n",
    "    hybrid_search=True,\n",
    "    text_search_config=\"english\",\n",
    "    hnsw_kwargs={\n",
    "        \"hnsw_m\": 16,\n",
    "        \"hnsw_ef_construction\": 64,\n",
    "        \"hnsw_ef_search\": 40,\n",
    "        \"hnsw_dist_method\": \"vector_cosine_ops\",\n",
    "    },\n",
    ")\n",
    "\n",
    "storage_context = StorageContext.from_defaults(\n",
    "    vector_store=hybrid_vector_store\n",
    ")\n",
    "hybrid_index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f8edee4-6c19-4d99-b602-110bdc5708e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "hybrid_query_engine = hybrid_index.as_query_engine(\n",
    "    vector_store_query_mode=\"hybrid\", sparse_top_k=2\n",
    ")\n",
    "hybrid_response = hybrid_query_engine.query(\n",
    "    \"Who does Paul Graham think of with the word schtick\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd454b25-b66c-4733-8ff4-24fb2ee84cec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Roy Lichtenstein\n"
     ]
    }
   ],
   "source": [
    "print(hybrid_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74ac0f05",
   "metadata": {},
   "source": [
    "#### Improving hybrid search with QueryFusionRetriever\n",
    "\n",
    "Since the scores for text search and vector search are calculated differently, the nodes that were found only by text search will have a much lower score.\n",
    "\n",
    "You can often improve hybrid search performance by using `QueryFusionRetriever`, which makes better use of the mutual information to rank the nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "345ba0fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.response_synthesizers import CompactAndRefine\n",
    "from llama_index.core.retrievers import QueryFusionRetriever\n",
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "\n",
    "vector_retriever = hybrid_index.as_retriever(\n",
    "    vector_store_query_mode=\"default\",\n",
    "    similarity_top_k=5,\n",
    ")\n",
    "text_retriever = hybrid_index.as_retriever(\n",
    "    vector_store_query_mode=\"sparse\",\n",
    "    similarity_top_k=5,  # interchangeable with sparse_top_k in this context\n",
    ")\n",
    "retriever = QueryFusionRetriever(\n",
    "    [vector_retriever, text_retriever],\n",
    "    similarity_top_k=5,\n",
    "    num_queries=1,  # set this to 1 to disable query generation\n",
    "    mode=\"relative_score\",\n",
    "    use_async=False,\n",
    ")\n",
    "\n",
    "response_synthesizer = CompactAndRefine()\n",
    "query_engine = RetrieverQueryEngine(\n",
    "    retriever=retriever,\n",
    "    response_synthesizer=response_synthesizer,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a88c84e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Paul Graham thinks of Roy Lichtenstein when using the word \"schtick\" because Lichtenstein's distinctive signature style in his paintings immediately identifies his work as his own.\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\n",
    "    \"Who does Paul Graham think of with the word schtick, and why?\"\n",
    ")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e5e8083",
   "metadata": {},
   "source": [
    "### Metadata filters\n",
    "\n",
    "PGVectorStore supports storing metadata in nodes, and filtering based on that metadata during the retrieval step."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d0ad3fc",
   "metadata": {},
   "source": [
    "#### Download git commits dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63e90a89",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/git_commits/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fef41f44",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<lakshmi@timescale.com>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache.  Fixes #6037 '}\n",
      "4167\n"
     ]
    }
   ],
   "source": [
    "import csv\n",
    "\n",
    "with open(\"data/git_commits/commit_history.csv\", \"r\") as f:\n",
    "    commits = list(csv.DictReader(f))\n",
    "\n",
    "print(commits[0])\n",
    "print(len(commits))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b0d9f47",
   "metadata": {},
   "source": [
    "#### Add nodes with custom metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3920109b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Node ID: 9c2c2f17-d763-4ce8-bb02-83cb176008e4\n",
      "Text: Fix segfault in set_integer_now_func  When an invalid function\n",
      "oid is passed to set_integer_now_func, it finds out that the function\n",
      "oid is invalid but before throwing the error, it calls ReleaseSysCache\n",
      "on an invalid tuple causing a segfault. Fixed that by removing the\n",
      "invalid call to ReleaseSysCache.  Fixes #6037\n",
      "2023-03-22 to 2023-09-05\n",
      "{'konstantina@timescale.com', 'nikhil@timescale.com', 'satish.8483@gmail.com', 'mats@timescale.com', 'fabriziomello@gmail.com', 'erik@timescale.com', 'sven@timescale.com', 'lakshmi@timescale.com', 'dmitry@timescale.com', 'engel@sero-systems.de', 'rafia.sabih@gmail.com', '36882414+akuzm@users.noreply.github.com', 'jguthrie@timescale.com', 'jan@timescale.com', 'me@noctarius.com'}\n"
     ]
    }
   ],
   "source": [
    "# Create TextNode for each of the first 100 commits\n",
    "from llama_index.core.schema import TextNode\n",
    "from datetime import datetime\n",
    "import re\n",
    "\n",
    "nodes = []\n",
    "dates = set()\n",
    "authors = set()\n",
    "for commit in commits[:100]:\n",
    "    author_email = commit[\"author\"].split(\"<\")[1][:-1]\n",
    "    commit_date = datetime.strptime(\n",
    "        commit[\"date\"], \"%a %b %d %H:%M:%S %Y %z\"\n",
    "    ).strftime(\"%Y-%m-%d\")\n",
    "    commit_text = commit[\"change summary\"]\n",
    "    if commit[\"change details\"]:\n",
    "        commit_text += \"\\n\\n\" + commit[\"change details\"]\n",
    "    fixes = re.findall(r\"#(\\d+)\", commit_text, re.IGNORECASE)\n",
    "    nodes.append(\n",
    "        TextNode(\n",
    "            text=commit_text,\n",
    "            metadata={\n",
    "                \"commit_date\": commit_date,\n",
    "                \"author\": author_email,\n",
    "                \"fixes\": fixes,\n",
    "            },\n",
    "        )\n",
    "    )\n",
    "    dates.add(commit_date)\n",
    "    authors.add(author_email)\n",
    "\n",
    "print(nodes[0])\n",
    "print(min(dates), \"to\", max(dates))\n",
    "print(authors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a638f76a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:11,383 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    }
   ],
   "source": [
    "vector_store = PGVectorStore.from_params(\n",
    "    database=db_name,\n",
    "    host=url.host,\n",
    "    password=url.password,\n",
    "    port=url.port,\n",
    "    user=url.username,\n",
    "    table_name=\"metadata_filter_demo3\",\n",
    "    embed_dim=1536,  # openai embedding dimension\n",
    "    hnsw_kwargs={\n",
    "        \"hnsw_m\": 16,\n",
    "        \"hnsw_ef_construction\": 64,\n",
    "        \"hnsw_ef_search\": 40,\n",
    "        \"hnsw_dist_method\": \"vector_cosine_ops\",\n",
    "    },\n",
    ")\n",
    "\n",
    "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n",
    "index.insert_nodes(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15f7cf45",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:15,149 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
      "2025-09-11 16:48:15,687 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.\n"
     ]
    }
   ],
   "source": [
    "print(index.as_query_engine().query(\"How did Lakshmi fix the segfault?\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ab03ed4",
   "metadata": {},
   "source": [
    "#### Apply metadata filters\n",
    "\n",
    "Now we can filter by commit author or by date when retrieving nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa6212e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:31,673 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-27', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-07-13', 'author': 'mats@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-30', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-07-25', 'author': 'mats@timescale.com', 'fixes': ['5892']}\n",
      "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.vector_stores.types import (\n",
    "    MetadataFilter,\n",
    "    MetadataFilters,\n",
    ")\n",
    "\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n",
    "        MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n",
    "    ],\n",
    "    condition=\"or\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67c19ec6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:40,347 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-23', 'author': 'erik@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-17', 'author': 'konstantina@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-24', 'author': 'lakshmi@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-20', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}\n"
     ]
    }
   ],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\">=\"),\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-25\", operator=\"<=\"),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f6e9cdf",
   "metadata": {},
   "source": [
    "#### Apply nested filters\n",
    "\n",
    "In the above examples, we combined multiple filters using AND or OR. We can also combine multiple sets of filters.\n",
    "\n",
    "e.g. in SQL:\n",
    "```sql\n",
    "WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = 'mats@timescale.com' OR author = 'sven@timescale.com')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94f20be7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:45,021 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n"
     ]
    }
   ],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilters(\n",
    "            filters=[\n",
    "                MetadataFilter(\n",
    "                    key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"\n",
    "                ),\n",
    "                MetadataFilter(\n",
    "                    key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"\n",
    "                ),\n",
    "            ],\n",
    "            condition=\"and\",\n",
    "        ),\n",
    "        MetadataFilters(\n",
    "            filters=[\n",
    "                MetadataFilter(key=\"author\", value=\"mats@timescale.com\"),\n",
    "                MetadataFilter(key=\"author\", value=\"sven@timescale.com\"),\n",
    "            ],\n",
    "            condition=\"or\",\n",
    "        ),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "737692ce",
   "metadata": {},
   "source": [
    "The above can be simplified by using the IN operator. `PGVectorStore` supports `in`, `nin`, and `contains` for comparing an element with a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85faf8b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:49,129 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}\n"
     ]
    }
   ],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n",
    "        MetadataFilter(\n",
    "            key=\"author\",\n",
    "            value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n",
    "            operator=\"in\",\n",
    "        ),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ab9c333",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:51,587 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-09', 'author': 'me@noctarius.com', 'fixes': ['5805']}\n",
      "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-11', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}\n",
      "{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': ['5908']}\n",
      "{'commit_date': '2023-08-01', 'author': 'nikhil@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-10', 'author': 'konstantina@timescale.com', 'fixes': []}\n",
      "{'commit_date': '2023-08-10', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}\n"
     ]
    }
   ],
   "source": [
    "# Same thing, with NOT IN\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-01\", operator=\">=\"),\n",
    "        MetadataFilter(key=\"commit_date\", value=\"2023-08-15\", operator=\"<=\"),\n",
    "        MetadataFilter(\n",
    "            key=\"author\",\n",
    "            value=[\"mats@timescale.com\", \"sven@timescale.com\"],\n",
    "            operator=\"nin\",\n",
    "        ),\n",
    "    ],\n",
    "    condition=\"and\",\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"What is this software project about?\")\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a46764cf",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 16:48:56,822 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}\n"
     ]
    }
   ],
   "source": [
    "# CONTAINS\n",
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"fixes\", value=\"5680\", operator=\"contains\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"How did these commits fix the issue?\")\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87e6ba67c60cf7a2",
   "metadata": {},
   "source": [
    "### Customize queries\n",
    "\n",
    "It is possible to build more complex queries such as joining other tables. This is done by setting the `customize_query_fn` argument with your function. First, lets create a user table and populate it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60db78476c3abb63",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import (\n",
    "    Table,\n",
    "    MetaData,\n",
    "    Column,\n",
    "    String,\n",
    "    Integer,\n",
    "    create_engine,\n",
    "    insert,\n",
    ")\n",
    "\n",
    "engine = create_engine(url=connection_string + \"/\" + db_name)\n",
    "\n",
    "metadata = MetaData()\n",
    "\n",
    "user_table = Table(\n",
    "    \"user\",\n",
    "    metadata,\n",
    "    Column(\"id\", Integer, primary_key=True, autoincrement=True),\n",
    "    Column(\"name\", String, nullable=False),\n",
    "    Column(\"email\", String, nullable=False),\n",
    ")\n",
    "\n",
    "user_table.drop(engine, checkfirst=True)\n",
    "user_table.create(engine)\n",
    "\n",
    "with engine.begin() as conn:\n",
    "    stmt = insert(user_table)\n",
    "    conn.execute(\n",
    "        stmt, [{\"name\": \"Konstantina\", \"email\": \"konstantina@timescale.com\"}]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49a45abfeebe0d4e",
   "metadata": {},
   "source": [
    "Then, we can create a query customization function and instantiate `PGVectorStore` with `customize_query_fn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e740eba3df7a66a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Any\n",
    "from sqlalchemy import Select\n",
    "\n",
    "\n",
    "def customize_query(query: Select, table_class: Any, **kwargs: Any) -> Select:\n",
    "    # Join the user table on the email addresses and add the name column to the select statement\n",
    "    return query.add_columns(user_table.c.name).join(\n",
    "        user_table,\n",
    "        user_table.c.email == table_class.metadata_[\"author\"].astext,\n",
    "    )\n",
    "\n",
    "\n",
    "vector_store = PGVectorStore.from_params(\n",
    "    database=db_name,\n",
    "    host=url.host,\n",
    "    password=url.password,\n",
    "    port=url.port,\n",
    "    user=url.username,\n",
    "    table_name=\"metadata_filter_demo3\",\n",
    "    embed_dim=1536,  # openai embedding dimension\n",
    "    hnsw_kwargs={\n",
    "        \"hnsw_m\": 16,\n",
    "        \"hnsw_ef_construction\": 64,\n",
    "        \"hnsw_ef_search\": 40,\n",
    "        \"hnsw_dist_method\": \"vector_cosine_ops\",\n",
    "    },\n",
    "    customize_query_fn=customize_query,\n",
    ")\n",
    "index = VectorStoreIndex.from_vector_store(vector_store=vector_store)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc3ee1d9de3113d9",
   "metadata": {},
   "source": [
    "We can then query the vector store and retrieve any additional field added to the select statement in a dictionary named `custom_fields` in the node metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "478f0d718dcea369",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-09-11 17:06:43,812 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912'], 'custom_fields': {'name': 'Konstantina'}}\n"
     ]
    }
   ],
   "source": [
    "filters = MetadataFilters(\n",
    "    filters=[\n",
    "        MetadataFilter(key=\"fixes\", value=\"5680\", operator=\"contains\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "retriever = index.as_retriever(\n",
    "    similarity_top_k=10,\n",
    "    filters=filters,\n",
    ")\n",
    "\n",
    "retrieved_nodes = retriever.retrieve(\"How did these commits fix the issue?\")\n",
    "for node in retrieved_nodes:\n",
    "    print(node.node.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b274ecb",
   "metadata": {},
   "source": [
    "### PgVector Query Options"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a490a0fa",
   "metadata": {},
   "source": [
    "#### IVFFlat Probes\n",
    "\n",
    "Specify the number of [IVFFlat probes](https://github.com/pgvector/pgvector?tab=readme-ov-file#query-options) (1 by default)\n",
    "\n",
    "When retrieving from the index, you can specify an appropriate number of IVFFlat probes (higher is better for recall, lower is better for speed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "111a3682",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = index.as_retriever(\n",
    "    vector_store_query_mode=\"hybrid\",\n",
    "    similarity_top_k=5,\n",
    "    vector_store_kwargs={\"ivfflat_probes\": 10},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6104ef8d",
   "metadata": {},
   "source": [
    "#### HNSW EF Search\n",
    "\n",
    "Specify the size of the dynamic [candidate list](https://github.com/pgvector/pgvector?tab=readme-ov-file#query-options-1) for search (40 by default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3a44758",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = index.as_retriever(\n",
    "    vector_store_query_mode=\"hybrid\",\n",
    "    similarity_top_k=5,\n",
    "    vector_store_kwargs={\"hnsw_ef_search\": 300},\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
